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ABSTRACT 



Management of speech and audio prompts, and interface 
presence, in multimodal user interfaces is provided. A com- 
munications device having a multimodal user interface 
including a speech interface, and a non-speech interface, e.g. 
a graphical or tactile user interface, comprises means for 
dynamically switching between a background state of the 
speech interface and a foreground state of the speech inter- 
face in accordance with a users input modality choice. 
Preferably, in the foreground state speech prompts and 
speech based error recovery are fully implemented and in a 
background state speech prompts are replaced by earcons, 
and no speech based error recovery is implemented. Thus 
there is provided a device which automatically subdue the 
speech prompts when a user selects a non-speech input/ 
output mechanism. Also provided is a method for dynamic 
adjustment of audio prompts and speech prompts by switch- 
ing from a foreground state to a background state of a speech 
interface in response to a users current interaction modality, 
by selecting alternative states for speech and audio inter- 
faces that represent users needs for speech prompts. This 
type of system and method is particularly useful and appli- 
cable to hand held Internet access communication devices. 

35 Claims, 10 Drawing Sheets 
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MANAGEMENT OF SPEECH AND AUDIO auditory, tactile and visual senses. Input/output modes refer 

PROMPTS IN MULTIMODAL INTERFACES to specific examples of use of these modalities. For example 

speech and audio input/output represent an auditory modal - 

RELATED APPLICATIONS ity; use of a keypad, pen, and touch sensitive buttons 

„ . .... . , , TT „ , , .. . „ 5 represent a tactile input modality, and viewing a graphical 

This application is related to U.S. patent application Ser. , *u ■ i j r* 

No. 09/062,969 entitled "Server for handling multimodal du f ay reheS , OD hc ™™ 1 m ° d , ahty ' , . A 

information" to H. Pasternak et al.; and allowed U.S. patent ^ J™?* of a , ?dtiinodal "f^" £ sc n bed m 

application Ser. No. 09/063,007 entitled 'Communication ^f" din 8 ^ • W k f° S c e , r - e " tllIed 

System User Interface with Animated Representation of 1(1 "Multimodal User Interface , filed Dec. 18, 1997, to Smith 

Time Remaining for Input to Recognizer' to M. French St. 10 »nd Beaton which is mcorporated herein by reference. This 

George, filed concurrently herewith. app hcation discloses a multi-modal user interface to provide 

a telecommunications system and methods to facilitate mul- 

FIELD OF THE INVENTION m °des of interfacing with users for example, using 

voice, hard keys, touch sensitive soft key input, and pen 

This invention relates to management of speech and audio 15 input. This system provides, e.g. for voice or key input of 

prompts in multimodal interfaces for communications sys- data, and for graphical and speech data output. The user may 

tems and devices, with particular application of management choose to use the most convenient mode of interaction with 

of speech prompt presence of a speech interface for access- the system and the system responds to input from all modes, 

ing a speech recognizer, . While for 

communications devices and com- 

BACKGROUND OF THE INVENTION 2 ° P U ' er SySte ™ S ** ^f" 6 f™ 8 ' 5 ' aWe 10 acce P l »P ul 

and provide output through various sensory modalities, 

Speech prompted interfaces have been used in telecom- existing systems and devices present some problems when 

munications systems in contexts where there is no visual the user tries to use particular input/output modalities 

display or the user is unable to use a visual displays. according to the task at hand. 

Typically, a speech interface prompts the user when to speak 25 In using such an interface, for example, a user might 

by providing a speech prompt, i.e. a recognisable phrase or request an item using speech, and then be presented with a 

question prompting for user input, or by emitting a 'speak list of choices on the screen, that requires some scrolling to 

now' beep, i.e. an earcon. After the prompt, a speech access the relevant section of the list. At this point the user 

recognizer is turned on for a limited time window, typically may choose to touch the scroll control and then touch an 

a few seconds, during which time the user may respond. 30 item on the list that they require. 

Telecommunications systems with a speech recognition Ideally the user wants to smoothly transition from one 

capability have been in use for some time for performing type of input/output modality to another, e.g. from a prima- 

basic tasks such as directory dialling. There are also network rily speech input/output to a graphical and touch control 

based speech recognition servers that deliver speech enabled 35 structure. However there are problems with providing this 

directory dialling to any telephone. Typically, when these transition in practice because there is an intrinsic conflict 

systems also offer a graphical user interface, i.e. a visual between speech interaction and graphical interaction styles, 

display, with a speech interface, in addition to a conventional Current graphical interfaces are directed through a task by 

tactile interface, i.e. a keypad, interfaces are discrete and a user. Nothing happens unless a user clicks on a screen 

non-integrated. That is, the system does not allow user 4Q based object or types from a keyboard. The user maintains 

tactile input and speech input at the same time. control of the interaction, and can pause and restart the task 

Computer users have long been used to inputting data at any time, 

using a keyboard or drawing tablet, and receiving output in [ n contrast, speech interfaces tend to direct a user through 

graphical form, i.e. visual information from a screen display a task. The user initiates the interaction, and thereafter the 

which may include, full motion, colour displays with sup- 45 speech recognizer prompts the user for a response, i.e. asks 

porting auditory 'beeps'. Speech processors for computers the user to repeat a name, etc. and expects an almost 

are now available with speech recognizers for receiving immediate input. As mentioned above, speech recognizers 

speech input, and converting to text, and speech processors for communications devices typically operate within a lim- 

for providing speech output. Typically, the speech process- ited time window, usually within a few seconds after a 

ing is embedded within an application which is turned on 50 speech prompt. Thus, the timing of the listening window of 

and off by the user as required. . speech recognizer controls the requirement for the user to 

Speech output and speech recognition capability are being respond, to avoid an error or reprompting. Users often report 

added to a variety of other electronic devices. Devices may feeling rushed when prompted to respond immediately after 

be provided with tactile interfaces in addition to, or instead a beep or other speech prompt. 

of, conventional keypads for inputting data. For example, 55 Natural language processors are known, which are on all 
there are a number of hand-held devices, e.g. personal the time, and thus can accept speech input at any time, 
organisers, that support pen input, for writing a touch However, these advanced speech recognizers require pro- 
sensitive area of a display, and cellular phones may have cessing power of a network based system and are not yet 
touch-sensitive displays as well as a regular numeric key- widely used. Consequently, for most speech recognizers, 
pad. 60 there is a limited time window to respond after a speech 
To overcome the inconvenience of switching between prompt, and the user receives no indication of how long 
discrete applications offering different modes of interaction, there is to respond. 

systems are being developed to handle more than one type In use of a multimodal interface, a user may feel particu- 

of interface, i.e. more than one mode of input and output larly pressured after switching to a touch and/or graphical 

simultaneously. In the following description the term input/ 65 input/output mechanism, when the voice prompts remain 

output modality refers to a sensory modality relating to a active. A user who receives both graphical prompts and 

user's behaviour in interacting with the system, i.e. by using speech prompts, may be confused as to which is the appro- 
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priate mode to provide the next input, or may interpret dual prising a speech interface for accessing a speech recognizer 

prompts to be annoying or redundant. and another interface, the method comprising: 

In some systems, speech prompts may be manually turned after prompting a user for input and capturing input, 

on and off by the user to avoid this problem. However, this generating an input identifier associated with the user 

procedure introduces an intrusive, unnecessary step in an 5 input modality, 

interface, necessitating that a user must remember to switch determining the mode of user input modality and selecting 
on the speech interface before providing speech input, and a corresponding one of a foreground state and a back- 
switch off before providing input by other modes. ground state of the speech interface. 
Furthermore, manual switching on and off of the speech Advantageously, when the users input modality is speech 
interface does not address management of speech based 10 input, the method comprises selecting the foreground state 
error recovery mechanisms. For example, if a user switches of ^ ^ch interface. When the users input modality is 
from speech input to pen input, and the speech interface non-speech input, for example tactile input, the method 
remains on, and has the same sensitivity to detected speech comprises selecting a background mode of the speech inter- 
input, a cumbersome error recovery mechanism may be facc * Preferably, in the foreground state of the speech 
invoked in cases where the recognizer was unable to detect 15 interface, audio prompts, including speech prompts and/or 
spoken input, or was unable to interpret the detected spoken earcons, and speech based error recovery are fully imple- 
input despite the presence of a specific pen input. mented; and in the background state of the speech interface, 

speech prompts are replaced by earcons, and no speech 

SUMMARY OF THE INVENTION based error recovery is implemented. 

Thus, the present invention seeks to provide a device, 20 When the user's input modality comprises more than one 

system and method for dynamic adjustment of speech and in P ut mode > for example both speech input and tactile input 

audio prompts in response to a user's current interaction are received, the method comprises selecting an appropriate 

modality which avoids or reduces some of the above men- foreground or background state of the speech interface, for 

tioned problems example according to a precedence system or selection 

a ,. ' c t t c t , . . 75 system based on predetermined set of priorities associated 

According to a first aspect or the present invention there . , r , „ ~ . ,. \_ . , 

■ • . . \. » • i ... . , with each input modality. Beneficially, the method provides 

is provided a communications device having a multimodal . . . j < ■ ■ * i r i_ i 

• . _ rr . „ a wmr ^ 'l ■ „ nf ; * ti ,1*1 :*:„«, for algorithms to determine appropriate selection of back- 
user interlace offering a user a choice or input modalities, ~ , . , 4 f f, r r l ■ * -p l a 

u ■ 4 r c • u ground or foreground state of the speech interface based on 

comprising a speech interface for accessing a speech & £ & . . 4 r , , . . 

recognizer and a graphical user interface, and comprising a s °? uc ° cc of P rcvi °^, m P uts ' ° r ba * d , on lhc f 

r j . , ( r j * . 30 needs of a user, and allowing for prediction of the most 

means for dynamically switching between a foreground state ... . , - ' * ,f . JL , A fl , 

f • » . r> j l i j . . r . likely mode ot the next input, according to the context ot the 

of a speech interface and' a background state of a speech . r b 

interface in accordance with a users input modality. . , , .. c , ... . . , A 

, . Thus the method provides for dynamically selecting alter- 

For example, where the user may choose amongst a natiye gtates for h interfaces that sen t a user > s 

plurality of input modalities comprising speech input and 35 needs for h prompts 

non-speech input such as tactile input the device switches of ^ inwGTlii on provides software on a 

to a background state of the speech interface when a comput er readable medium for carrying out these methods, 

non-speech input mode is selected, and switches to a fore- According to another aspect of the invention there is 

ground state when speech input mode is selected. provided a system fof dynamic adjuslment of audio prompts 

Advantageously, in the foreground state of the speech 40 in reS ponse to a users interaction modality with a commu- 

interface, audio prompts and speech based error recovery are nications device having a multimodal interface comprising a 

fully implemented. In the foreground state audio prompts speech interface for accessing a speech recognizer, and 

may comprise speech prompts, earcons, or both earcons and another interface, comprising: 

speech prompts. Thus, the speech recognizer responds to a means for determining the mode of user input modality 
full vocabulary set and provides a full range of prompts to 45 and selecting a corresponding one of a foreground state 
the user. In a background state speech prompts are replaced and a background state 0 f the speech interface, 
by a limited set of audio prompts, or earcons, and no speech Appropriate speech and/or audio prompts are thus 
based error recovery is implemented. In the background selected ^ type of switching mecha nism for selection 
state of the speech interface, the speech recognizer therefore between two states of the speech interface is particularly 
operates m a more conservative mode. For example the 5Q useftll aod applicable to hand neld Internet access devices . 
recognizer may respond only to a limited set of vocabulary, Whh the devclopmcnt of mu ui mo dal interfaces, i.e. those 
and not act on other sounds or speech inputs, and output only in wnich users have a cnoice between a variety of input/ 
a limited set of audio prompts. Thus an appropriate set of { seQSOry mechanisrnS) tne inve ntion therefore pro- 
audio prompts or speech prompts is selected depending on vides a way t0 manage the presence' of speech prompts in 
the user input modality. The user is then able to concentrate 5S comexts where users choo&e lo use a non . auditory / non . 
on non-speech interaction with the system by selected speech interaction. ■ There is provided automatic and 
input/output modality, without distracting speech prompts. dynamic contfol of wMch mode of inleraclion ^ ^ or 

Earcons are a limited or simplified set of audio prompts or ta k cs pre cedence during a user interaction, 

recognizable sounds, in contrast to the speech prompts By automatically adjusting the 'presence' of speech 

which inciude an appropriate vocabulary set. 60 pror npts in speech interfaces, in accordance with a users 

Thus there is provided a device which automatically input modality choices, the resultant multi-modal interaction 

subdues the speech prompts when a user selects a non- with a device is rendered more natural to the user, i.e. the 

speech input/output mechanism. user has greater sense of control, while still receiving an 

According to a another aspect of the present invention appropriate level of feedback from the system, 
there is provided a method for dynamic adjustment of audio 65 Thus, when the user selected a non -speech input/output 
prompts in response to a users interaction modality in a mechanism, the speech recognizer switches to the back- 
communications system having a multimodal interface com- ground state. 
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In the background state of the interface, the speech Thus, as shown in FIG. 2, mobile telephone unit 100 
recognizer is put into a conservative recognition state, in comprises a body 200 which carries a display screen 210 for 
which the recognizer responds only to a very limited set of the graphical user interface, which may include touch sen- 
speech vocabulary, and response to extraneous noise, a sitive buttons 212; a conventional keypad 220 and other hard 
cough, or other speech input is ignored. Thus, when the user 5 keys 222; speaker 240 associated with the speech interface 
selects keypad, pen or touch input the speech recognizer to providing speech prompts as shown schematically by 230 
automatically goes into background state. to illustrate the various modes of interaction which may be 

The speech recostazer will switch back to foreground f clccted bv a Thus the unit provides for conventional 

state, when it detects and recognizes a valid speech from the kev . m P ut > %g^phical user interface which includes touch 

t. * * * j t_ «l f* *u . * „ sensitive soft keys, and a speech interface comprising a 

user, or when activated by the use, or alter the system issues 10 , t J ' . \ m , , r b . 

' , , . i . , j speech prompt means to output audio prompts, and a speech 

speech prompts after which speech input is expected. recognizer to accept and interpret speech input. 

In hand held portable devices offering limited ^ ^ ^ of ^ mobi , e on£ ^ 1Q0 ^ 

funct.onal.ty, providing for switching between a foreground ^ net * ork fa d £ scribed in more detafl in (he above men . 

and background state of the speech interface, and selection , ioned v s ten , application ; 

of a background state offering more conservative audio is Jhe deyice 1Q0 accordj t0 , first embodimcnt of the 

prompts and speech recognition, may also have the benefit m invention comprises a multimodal user interface 

of conserving processing power required for operation of the inc]uding a spe ech interface for speech input and output and 

speech interface, while allowing for full operation where for acceS sing a speech recognizer, and non-speech interfaces 

foreground mode of the speech interface when required. f or tact ii e mput ana - graphical output, shown on an enlarged 

n-.-r nrc™mnra.i ™. c „ n .„„. T ^c- 20 scale in FIG. 3. The device 100 also comprises means for 

BRIEF DESCRIPTION OF THE DRAWINGS dynarnically switching between a background state of the 

The invention will now be described in greater detail with speech interface and a foreground state of the speech inter- 
reference to the attached drawings wherein: face in accordance with a users input modality choice. 

FIG. 1 shows a schematic block diagram of a communi- Thus, for example a user may pick up the mobile phone, 

cations network comprising a mobile telephone having a 25 thus activating the unit, and turning on all default input/ 

multitasking graphical user interface consistent with a first 0Ut P ut modalities. The user may then select one of several 

embodiment of the present invention; ra ? des of interaction with the unit. If for example the user 

_ , , . 11 . , r initiates the interaction with a speech mput, the speech 

FIG. 2 shows schematically an enlarged diagram of a ^ pkced [n ^ forcgrour £ state> u< it ^ mmed on 

mobile telephone ot FIG. 1, ^ in me f ore g roun d state, or. remains active in the foreground 

FIG. 3 shows further detail of part of a touch sensitive state, 

graphical user interface of the mobile telephone shown in Qn the other hand, if for example, the user uses the 

FIG. 2 on an enlarged scale; keypad, and or soft keys on the graphical user interface, to 

FIGS. 4 and 5 shows a high level flow chart setting out the initiate interaction, e.g. to obtain a directory listing which is 

steps associated with a method of for dynamically managing 35 displayed on the screen, by using tactile input to scroll 

speech prompts according to an embodiment of the present through the directory listing, the user may then choose to 

invention, relating particularly to steps for processing user complete the task by issuing a speech command to initiate 

input a multimodal interface of a telecommunications dialling. Simply by touching, one of the keys and thus 

device- selecting non-speech input as the initial step, to obtain the 

FIGS. 6 to 10 show additional flow charts setting out n ^tory listing on the display the user places the speech 

- . . , „ , -ii interface in background state, where in the conservative state 

further details of steps of the method for dynamically of ^ niz £ background nois e, or conversation, 

managing speech prompts according to this embodiment. coughing or extraneous sounds are unlikely to be recognized 

DETAILED DESCRIPTION OF THE anc * aclec * u P on by the recognizer. The user is therefore not 

INVENTION bothered by unnecessary voice prompts, nor does the user 

45 have to be as careful not to make any extraneous noise. 

A schematic block diagram of a communications network However once the user has obtained a particular directory 

10 is shown in FIG. 1 and represents a GSM switching listing on the screen and scrolled to the appropriate listing, 

services fabric 20 and a network services provider 40 the user may choose issue a speech command, i.e. 'dial 

associated with a plurality of communications terminals, for John', by using one of the limited vocabulary set recognized 

example a mobile telephone 100, and other wired or wireless 50 by the recognizer in the background state, thus bringing the 

communications devices and terminals represented sche- recognizer back into foreground state and allowing full 

matically by units 110, 120, 130, 140 and 150. voice recognition capability of the foreground state includ- 

The wireless mobile phone 100 according to a first in S s P eech Prompting, and error recovery, 

embodiment of the present invention is shown enlarged in The user ma y thus proceed to interact via the speech 

FIGS. 2 and 3, is provided with a multitasking graphical user 55 interface until again inputting a command via the keys or 

interface, similar to that described in copending US appii- tactllc interface to initiate another process, thereby putting 

cation Ser. No, entitled "Multimodal User Interface" filed the s P eech interface back into background. 

Dec. 19, 1997, 08/992,630 to Smith and Beaton, which is Introduction of adaptive speech recognizer parameters 

incorporated herein by reference; This is a multi-modal user thus enable the recognizer to perform more conservative 

interface which provides a telecommunications system and 60 s P eech detection based on user input modality, 

methods to facilitate multiple modes of interfacing with In the more conservative mode, the recognizer speech 

users for example, using voice, hard keys, touch sensitive outputs are also muted, thus reducing the perceived presence 

soft key input, and pen input. This system provides, e.g. for of the recognizer functionality. 

voice or key input of data, and for graphical and speech data Multi-modal input buffers are provided that enable the 

output. Thus the user may choose to use the most convenient 65 system to wait until all user inputs have been received and 

mode of interaction with the system and the system responds interpreted. The wait time is dependent on the context of the 

to input from all modes. application. 
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Error recovery mechanisms are provided that permit the identify service, 

system to return a list of choices that mirror the multi-modal identify city/state, 

matches that the system has detected. The method may be identify business category, and 

implemented using decision making algorithms to allow the display business name. 

system to decide which inputs are most likely given the 5 At each layer, the speech recognizer loads an appropriate 

current context, i.e. based on the sequence of input steps vocabulary list with an appropriate set of prompts, 

previously used, and prediction of the most likely selection At the first layer, the appropriate vocabulary lists for the 

to allow checking for input modality to take place is the most service layer are loaded. 

efficient manner. The method may be implemented using Thus the first prompt will ask the user to 'identify the 

sensing algorithms to enable the system to detect that the 10 service'. On receiving input from the user, the system 

user has changed input modality. This type of adaptive generates an input identifier associated with the respective 

interface is particularly applicable and advantageous in type of each interface modality selected. After generating the 

contexts where non PC based devices are used to access IP m P ut identifier based on the initial user input, the systems 

services, i.e. devices with limited functionality and process- determines the mode of user input. If both speech and touch 

ing power in the device itself, or where there is simplified is £ terfa <* s ^ used, then the system defers to touch, un ess 

c K* r* t a • i ■ u • „ T ',u n nat „ tnr u „,u-„u there a valid spoken vocabulary match. That is, the system 

functionalityofdevicewruchinterac^withanetworkwhich detects and ^ both J fa ^ ^ ^ (he 

provides more advanced functionality. system ^ ^ outcomc ^ a and offers a choicc 

Enabling users to switch between automatically and non to the user for the next input, 
intrusively between input and output modalities in a natural Following the steps in the flow chart of FIG. 6, after 
way without overtly switching between modalities renders 20 prompting for input, and determining the mode of input is 
products easier to use. Enabling more natural interfaces to not speech input, if speech recognition is enabled the speech 
network based services is likely to become a competitive interface will be switched into the background state, 
differentiator in the telecommunications terminals market. Advantageously, the display should indicate the back- 
„ . . . , . . i ... ground state of the speech recognizer, i.e. by displaying an 
Speech recognition may reside on the network or at the 2$ [cQn QJ m Q fa 5ack nd state lhe recognizer win- 
terminal device. The device and process of the embodiment dows are at me usual time locationS) i e . the reC ognizer 
thus facilitate speech access to information sources. It is fe mmed Qn> bm me recognizer will return a match only if 
believed to be particularly useful for hand held commum- there is a sing]e highly probable sp oken vocabulary match, 
cations devices such as mobile phones and other specialized ^ confirmation tnat lhe mterface is in back . 
or portable terminals for information access and Internet 3Q ground state? the systems prompts the user, via the graphical 
based services, other than the example described in detail interface) t0 select a layer 

above, in which the device will prompt users for input, or [n background state> the < speak now > earcon is muted) and 

inform users of the status of their interaction using a variety lhe speech prompts are turned off . 

of graphical user interfaces, speech and auditory prompts, The backgrcmnd layer will also commun i C ate a method of 

and correspondingly, users will be able to respond using 35 returniag t0 se i ec tion of other services, a list of services 

touch, pen, keyboard, or speech. available, and that the user may select a service by touching, 

A more detailed description of the method operation of a or speaking, or, if appropriate, the service name, 

system implementing an adaptive interface as described In response to each selection input by the user, an appro - 

above, will now be described with reference to a second priate next choice is displayed. 

embodiment of the invention, which relates to a Yahoo™ 40 For example if the service layer is selected the display will 

Yellow Page™ (YYP) directory application. The process show: 'Service selection layer* and -play an audio speech 

will be described with reference to the flow charts shown in prompt associated with the service selection layer. 

FIGS. 4 to 10. Otherwise the system will offer other selections as indi- 

In use of the YYP application, the user interacts with a cated ia tbe flow charl of FIG * 6 > and check for selection of 

communications device, which provides a speech interface 45 each of these °P tlons - 

and other interaction modalities including at least tactile For example when prompting the user to select the 

(touch) interface and a graphical user interface, in a similar city/state layer, if this selection is indicated, this city/state 

manner to that described in the first embodiment. The device la y er W 0 ™^ Wl11 be displayed and the appropriate vocabu- 

may be for example a mobile phone or other communica- [ W wiU be loaded - As indicated m the flow chart, for 

tions device, for example a portable hand held Internet 50 example, if the speech recognizer is on, corresponding 

access device, supported by a network communications speech prompts will be played. 

systems as described above. As mentioned above, the lf the s P eech recognizer is in background, an earcon e.g. 

method is generally applicable to all systems that access a bee P indicated by (♦), will be played, to indicate that 

speech recognition engines, whether these are implemented s P eech m P ut wlU now be a ccepted, and the speech recog- 

locally, or on a network server. 55 nition window * °P ened " 

Subsequently, the user will be offered selections ror 

Firstly the system or device prompts the user for input, business category layer, business sub category layer and 

and receives input from the user by one of the number of business laycr in sequcncc> If one of these layers selected, 

possible interface modalities. The system generates an input a ^i^i^ appropriate to the layer will be displayed and an 

identifier associated with the selected input sensory modality 6Q audio or specch tag wil] be played> ^ approp riate. 

(FIG. 4) If more than one input is received within a For examp ie f if the user selects the business category 

predetermined time dependent on the users context, the user layer> thc display win mdicatcd that the user can write a 

is presented with a choice of valid next steps. business category, select a category from those displayed, or 

As indicated schematically in FIG. 5, the YYP application speak a business category, within the speech recognition 

is a multi-layer application, and at each layer, prompts are 65 time window. 

selected associated with that layer, for example, the layers of On selection of the business sub-category layer, a user 

this Yellow Pages directory are: will select a business sub-category from those displayed by 
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touch, or speak a selection from the business sub category 
within the time recognition window. 

The user will be able to see the context of the sub- 
categories on the display. 

The user will also see a way of selecting to go up the 
business category hierarchy or go back to the city/state 
selection or select another city/state. 

The city/state will then be displayed. 

On selection of the business name, the display indicates 
the business names associated with the users business cat- 
egory selection. The business names will be listed under the 
leaf match. 

When speech recognition is enabled, the input mode is 
checked to determine whether the speech recognition should 
be on. When there is a prompt for input, and speech 
recognition is not recognized, the screen displays a prompt 
to select or write the next request. 

Each time an input is received and an identifier is 
generated, the respective recognizer associated with the 
interface will seek a match. Where there is more than one 
match, a token will be returned for each match. The system 
must then determine which mode of input takes precedence. 

With reference to the flow chart in FIG. 7 which shows 
one embodiment of a sequence of method steps for deter- 
mining the various modes of input so that the system selects 
the appropriate modality for the next level of prompts, that 
is, either turn on speech prompts, or issue graphical prompts. 

For, example as illustrated, the system gets input and 
checks whether the speech recognizer is on. If the speech 
recognizers on, the spoken input is captured. 

Similarly, the system also checks for input using other 
input modalities. Inputs for each mode are captured, and as 
the flow chart shows identifiers for written, spoken and 
touch input are generated, the input is processed and tokens 
representing a match for each input are processed by a 
sequence of steps shown in FIG. 7. 

Tbe flow chart in FIG. 9 shows a sequence of steps for 
recognizing input when there is speech input and when the 
speech recognizer is in background state. The flow chart in 
FIG. 10 sets out steps for determining time out failures and 
re-prompting the user for further input where speech input 
has not been understood, in other words a method for error 
recovery. 

As shown in the flow chart of FIG. 7, the input is obtained, 
and the systems determines if the speech recognizer is on. If 
it is on, the speech input is captured, and the next prompt is 
provided by the speech interface. Alternatively if the speech 
recognizer is off, the system determines whether a ' touch - 
to-talk' option is selected i.e. receiving input from a touch 
sensitive button on the display. If yes, the speech interface 
is turned back on, into foreground state and the system 
Issues a speech prompt for input. This option provides the 
user with one way of turning the speech prompts back on. 

The system then queries, in turn, if a touchable item, a pen 
input, or a valid speech input is received. For each that is 
identified or matched, an valid token is generated. If none 
are generated and the wait for input exceeds the time out, 
then the system either re-prompts or for example, on the 
second attempt if no valid input is received, leaves the 
service layer and returns to another layer. 

If valid tokens are returned for more than one mode, i.e. 
touch input, pen input and speech input, the system again 
determines if the speech recognizer is in background or 
foreground state, switches the speech recognizer in back- 
ground state, returns tokens for each mode, and returns an 
OK to proceed. 

As shown in the flow chart of FIG. 8, if there is more than 
one token returned, the system provides a subsequent screen 
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prompt for input, and captures input by the appropriate 
mode, depending on whether the recognition window of the 
. speech recognizer is open or closed, and whether the speech 
interface is in background state or foreground state, 
s If only one token is returned, the systems seeks a com- 
mand word, i.e. one of a limited vocabulary set, recognized 
by the speech recognizer, which direct the system how to 
proceed. 

QUIT — quits yellow pages and returns to system idle 
50 RETURN TO YELLOW PAGES— quits yellow pages 
and returns to service selection 

SELECT OTHER SERVICE— returns to prompt for input 

at the same layer 
NONE OF THESE— returns to the top of the yellow 
j 5 pages and requests city and state 

If a command word is received, the system executes the 
command, or otherwise determines if there is a match to 
city/state to allow a move to the next layer, when there is a 
further prompt for input. 

Otherwise the systems retains the screen prompt for 
20 which city and state. The system then queries if the speech 
recognizer is on or off (i.e. whether the speech recognition 
window is open or closed). If the speech recognizer is off, 
the system gets input, and if speech recognizer is in back- 
ground then issues an audio prompt, e.g. a muted beep to 
25 indicate the speech recognizer window opening, or if speech 
recognizer is in foreground issues a speech prompt, i.e. 'Say 
the city and the state in which you want to search*. 

After capturing input, if more than one token is returned, 
then the screen prompts for speech recognizer on. If the 
30 speech recognizer is off, the system will get input from other 
modes, if the speech recognizer is in background an audio 
prompt Muted beep for speech recognizer window opening 
is given. If the speech recognizer is in foreground, then the 
system queries if command words are detected. Again the 
35 relevant set of command words as mentioned above direct 
the system how to proceed if there is ambiguity of this sort. 

Thus these commands, for example, direct the system to 
reset or go to another layer. If no command words are 
detected, a speech prompt is given "More than one match 
40 was found for your request, select the city and state from the 
list displayed." 

If the command words were detected, a speech prompt is 
given "More than one match was found to your request. 
Select the city and state from the list or <command word>" 
45 and further input will be sought. 

When the speech recognizer is on, the input is accepted 
and a determination is made whether the input is recognized. 
If no, and SPEECH RECOGNIZER is not in background all 
recognized tokens and tags are returned. 
50 If yes and the SPEECH RECOGNIZER is in background, 
first the speech prompts are turned on, and then all recog- 
nized tokens and identifiers are returned. 

If the input is not recognized and if the speech recognizer 
is not in background state, further input will be prompted 
55 and sought. If SPEECH RECOGNIZER is in background 
there will be a return with no match. 

The flowchart in FIG. 10 sets out an example of an error 
recovery route for a time out failure because input was 
received to late or too early, and for input that is simply not 
60 recognized. 

When speech input is detected, it is preferable that there 
is also a graphic display which reinforces the open speech 
recognition window and displays visual feedback of 'sound 
detected*. 

65 When the screen displays all complete city/state matches 
and command words that were matched, the screen also 
displays a prompt 'Make choices'. 
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If there is a speech match to a city only, all valid city/state 7. A device according to claim 6 wherein the speech 
combinations are returned, even if there was only a city prompts are automatically subdued when a user selects a 
match. The user is then invited to choose the appropriate non-speech input modality. 

city/state combination. 8. A device according to claim 1 wherein the device also 

While the flowcharts set out specific methods for deter- 5 includes at least one of keypad interface, a tactile interface 
mining priority of input modes, identifiers and tokens, and an( j a pcn mpijt interface. 

for error recovery, it will be apparent that these represent 9. a device according to claim 1 comprising a mobile 
only one implementation of such methods by way of telephone. 

example only and with reference to one specific directory 10. A device according to claim 1 comprising a hand held 

application exemplifying management of speech prompts by 10 Internet communication device. 

switching between background and foreground states of a n.A device according to claim 1 wherein the speech 
speech recognizer. recognizer is accessed on a network system. 

Other applications using telecommunications devices 12. a device according to claim 1 wherein the speech 
with multimodal user interfaces also benefit from the use of recognizer is provided by the device, 

dynamic management of a speech interface and speech 15 13. A method for dynamic adjustment of audio prompts in 
prompts as described above. response to a users interaction modality in a communica- 

The embodiments were described with reference to a tions system having a multimodal interface comprising a 

network service based on a GSM switching network, and a speech interface for accessing a speech recognizer and 

architecture based on a Java telephony application interface, another interface, the method comprising: 

as described in the above referenced copending application 20 after promp ting a user for input and capturing input, 

"Multimodal Interface" to Smith and Beaton. The services . ■ t A *u *u 

, , r 1 * ■ 1 . 1 . generating an input identifier associated with the user 

described herein are preferably implemented using a net- ° input modality 

work server as described in copending U.S. patent applica- . , , * 

tion entitled ' Server for handling multimodal information' to determining the mode of user input modality and selecting 

Henry Pasternak, filed concurrently herewith, and which in 25 a corresponding one of a foreground state and a back- 
incorporated herein by reference. This server was developed % A g T ^ SpeeC ? ^ • ■ u u 
for handling information in different modal forms associated 14. A method according to claim 13 comprising when the 
with respective input/output modalities of a multimodal us^rs input modality is speech, selecting the foreground state 
interface, and may be used to implement the systems and of lhe s P eech intcrface » 

methods described herein. 30 an ^ when the users input modality is non-speech, select- 
Nevertheless, the device and methods described herein in g a background mode of the speech interface, 
may be implemented on other systems capable of supporting 15 A method according to claim 14 wherein 
multimodal interfaces, and are intended to be substantially in the foreground state of the speech interface, 
platform independent. implementing audio prompts comprising one of speech 

Thus, although specific embodiments of the invention 35 prompts, earcons, and earcons and speech prompts, and 
have been described in detail, it will be apparent to one implementing full speech based error recovery, and, 

skilled in the art that variations and modifications to the. j n the background state of the speech interface, imple- 
embodiments may be made within the scope of the follow- menting audio prompts comprising earcons replacing 

ing claims. speech prompts, and implementing no speech based 

What is claimed is: 40 error recovery. 

1. A communications device having a multimodal user ig a method according to claim 14, wherein in a fore- 
interface offering a user a choice of input modalities, com- ground state, audio prompts include speech prompts, 
prising a speech interface for accessing a speech recognizer, 17. A method according to claim 13 wherein when the 
and a graphical user interface, and comprising means for user's input modality comprises more than one input mode, 
dynamically switching between a foreground state of a 45 the method comprises selecting an appropriate foreground 
speech interface and a background state of a speech interface slate or background state of the speech interface according 
in accordance with a users input modality. to a precedence system. 

2. A device according to claim 1 wherein, in the fore- \$. a method according to claim 13 wherein when the 
ground state audio prompts and full speech based error user's input modality comprises more than one input mode, 
recovery are implemented. 50 the method comprises selecting an appropriate foreground 

3. A device according to claim 2 wherein audio prompts stat e or background state of the speech interface according 
comprise one of speech prompts, earcons, and both speech to a context of the application. 

prompts and earcons. 19. A method according to claim 13 wherein when the 

4. A device according to claim 1 wherein in the back- user > s mput modality comprises more than one input mode, 
ground state audio prompts comprising earcons, and no 55 the method comprises selecting an appropriate foreground 
speech based error recovery are implemented. sta t e or background state of the speech interface according 

5. A device according to claim 3 wherein in the back- t 0 a selection system. 

ground state speech prompts are replaced by earcons, and no 20. A method according to claim 13 wherein when the 
speech based error recovery are implemented. user's input modality comprises more than one input mode, 

6. A device according to claim 1 wherein, 60 mc me thod comprises selecting an appropriate foreground 
in the foreground state, audio prompts comprising one of state or background state of the speech interface based on 

speech prompts, earcons, and both speech prompts and predetermined set of priorities associated with each input 
earcons are implemented and full speech based error modality, 

recovery is implemented, ' 21. A method according to claim 13 wherein the method 

and in the background state, audio prompts comprising 65 comprises determining an appropriate selection of the back- 
earcons replacing speech prompts, and no speech based ground or foreground state of the speech interface based on 
error recovery are implemented, a sequence of previous user inputs. 
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22. A method according to claim 13 wherein selection of comprising one of speech prompts, earcons, and earcons and 
the background or fore ground state of the speech interface is speech prompts, and provides full speech based error 
based on the expected needs of the user. recovery, and in the background mode of the speech inter- 

23. A method according to claim 13 wherein when the f aC e provides audio prompts comprising earcons replacing 
user selected a non-speech input/output mechanism, switch- 5 spC ech prompts, and provides no speech based error recov- 
ing the speech interface to the background state. erv 

24. A method according to claim 13 wherein in the 2 9. A system according to claim 28, wherein in a fore- 
background state a speech recognize of the speech interface ground statCj audio prompts include speech pr0 mpts. 

is put into a conservative recognition state, in which the 3 0. Asystem according to claim 27 comprising means for 

STut rCSP ° nlyt ° aVery hm "^^ofH^cfaand io seIectiQg an appropriate foreground state or background 

aU -!f m ^ U * . i _j , , • ^ . • *u u i state of the speech interface according to a precedence 

25. A method according to claim 23 wherein the back- t . . . 4 A _ u . 7 . / , t . t . 
ground state is selected until speech input is recognized by s ^ em when in P ut * "P**d multiple input modalities, 
the recognizer and then switching from the background state 31. Asystem according to claim 27 comprising means for 
to the foreground state of the interface. 15 sclcctin S an appropriate foreground state or background 

26. Software on a computer readable medium for carrying state of the s P eech interface based °n a context of the 
out a method for dynamic adjustment of speech prompts in application when input is captured for multiple input 
response to a users interaction modality in a communica- modalities. 

tions system having a multimodal interface comprising a 32. Asystem according to claim 27 comprising means for 

speech interface for accessing a speech recognizer, and 20 selecting an appropriate foreground state or background 

another interface, the method comprising: state of the speech interface according to a selection system 

after prompting a user for input and capturing input, when in P ut * captured for multiple input modalities. 

generating an identifier associated with the user input 33. Asystem according to claim 27 comprising means for 

modality selecting an appropriate foreground state or background 

, , • ■ \, * j 1* j 1 *• 25 state of the speech interface based on predetermined set of 

determining the user input modality and selecting a cor- ... r . , . . , . t *\ , 

responding one of a foreground state and a background pnonties associated with each input modality when input is 

: state of the speech interface. ca P tured for multl P le in P ut modalities. 

27. A system for dynamic adjustment of audio prompts in 34. A system according to claim 27 comprising means for 
response to a users interaction modality with a communi- selecting an appropriate foreground state or background 
cations device having a multimodal interface comprising a state of the speech interface based on a sequence of previous 
speech interface for accessing a speech recognizer and user inputs. 

another interface, comprising: means for determining the 35. Asystem according to claim 27 comprising means for 

mode of user input modality and selecting a corresponding selecting an appropriate foreground state or background 

one of a foreground state and a background state of the 35 state of the speech interface based on the expected needs of 

speech interface. the user. 

28. A system according to claim 27 wherein the fore- 
ground state of the speech interface provides audio prompts * * * * * 



05/03/2004, EAST Version: 1.4.1 



